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Abstract.  For  the  first  participation  in  the  TREC  Total  Recall  track,  we  set  out  to  try 
some  basic  changes  to  the  baseline  provided  by  the  organisers.  Namely,  the  weighting 
scheme,  the  use  of  stopwords,  and  the  number  of  learners  that  contribute  to  the  decision 
of  which  documents  to  ask  the  virtual  assessor  to  review.  We  observed  that  the  baseline 
was  extremely  strong  and  none  of  the  runs  significantly  and  consistently  outperformed  it. 


1.  Introduction 

As  the  organizers  point  out,  the  focus  of  the  Total  Recall  Track  is  to  evaluate  methods  to 
achieve  very  high  recall,  including  methods  that  include  a  human  assessor  in  the  loop  [6]. 

We  submitted  six  automated  runs  for  the  small  At  home  task  and  provided  scripts  for 
the  sandbox  evaluation.  The  athomel  data  contains  290099  files  grouped  in  115  folders, 
with  333  files  in  the  smallest,  16850  in  the  largest,  a  mean  of  2522.6  files  per  folder  and  a 
median  of  2228.  Table  1  shows  the  10  topics  used  for  the  athomel  part. 

Two  other  collections  were  tested  in  the  sandbox  environment:  the  MIMIC  II  clinical 
dataset^  (C)  and  the  Kaine  Email  Collection^  (Kaine)  indexed  each  31174  and,  respectively, 
401953  documents. 


^https : //physionet . org/mimic2/mimic2_clinical_overview. shtml 
^http : //www. virginiamemory . com/ collect ions/kaine/ 


Table  1.  athomel  topics 


Topic 

athomelOO 

athomelOl 

athomel02 

athomel03 

athomel04 

athomel05 

athomelOO 

athomelOT 

athomel08 

athomelOO 


Information  Need 

School  and  Preschool  Funding 

Judicial  Selection 

Capital  Punishment 

Manatee  Protection 

New  medical  schools 

Affirmative  Action 

Terri  Schiavo 

Tort  Reform 

Manatee  County 

Scarlet  Letter  Law 
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2.  Approach 


For  this  year’s  participation,  we  have  only  modified  the  provided  Baseline  Model  Imple¬ 
mentation  (bmi)  [2]  to  test  some  very  simple  changes  to  the  method.  Namely,  we  looked 
at  the  use  of  stopwords  (runs  indicated  by  the  presence  of  an  “S”  in  their  name),  the  use 
of  a  different  term  weighting  scheme — a  recently  introduced  adaptation  of  BM25  [3]  that 
does  not  need  optimizing  for  the  b  parameter,  and  a  modified  voting  of  6  learners  instead 
of  using  only  1.  We  discuss  the  weighting  in  the  following  subsection.  The  learner  voting 
mechanism  appears  in  Section  2.2. 

The  common  pre-processing  are  fundamentally  the  same  as  those  of  the  bmi.  We  only 
counted  additional  data  necessary  in  the  adapted  BM25. 

tokenisation:  tokens  are  identified  by  first  splitting  on  non- alphanumeric  characters 
and  then  removing  all  strings  thus  obtained  that  contain  at  least  one  digit.  All 
tokens  of  length  1  were  ignored, 
casing:  all  tokens  were  lowercased 

stopwords:  only  for  runs  marked  by  “S”,  tokens  appeared  in  the  list  in  the  Appen¬ 
dix  A  were  ignored. 


2.1.  Term  weighting.  The  bmi  used  the  basic  tf.idf  weighting  scheme,  as  given  by; 


(1)  weightrit,  d)  =  (1  -h  log{tft,d))  *  log{N/dft); 


where  t  is  a  term,  d  a  document,  tft^d  the  term  frequency,  dft  the  document  frequency,  and 
N  is  the  number  of  documents  in  the  collection. 

For  our  weighting,  we  used  an  observation  recently  by  Lipani  et  al.  [3],  that  using  the 
average  term  frequency  in  a  document  and  the  mean  average  term  frequency  over  the 
collection,  we  can  define  for  BM25  a  b  parameter  that  is  collection  specific.  This  would 
be  particularly  useful  here,  since  it  would  save  us  some  training  effort.  The  used  weight, 
marked  by  “B”  in  the  run  names,  is  given  by; 


(2) 


weightsit,  d) 


tft,d  N  -  dft  +  0.5 


( _ 1 _ ^'’9tfd  I  (I _ 1 _ dft +  0.5 

ymavgtf  mavgtf  ^  '  mavgtf  /  avgdl  J  d  ^  Jt,d 


where,  apart  from  the  variables  already  presented  for  Eq.  1,  fei  is  the  usual  BM25  parameter 
controlling  tf  normalization,  avgtfd  and  Ld  are  the  average  term  frequency  in  document 
d  and,  respectively,  its  length.  Finally,  mavgtf  and  avgdl  are  the  mean  average  term 
frequency  and  the  average  document  length,  calculated  over  all  documents.  Throughout 
the  experiments,  ki  was  maintained  at  its  “standard”  value  of  1.2.  We  note  that  there  exists 
previous  work  that  removes  the  need  for  optimizing  on  ki  [4],  which  could  be  applied  here. 
At  this  time,  it  remains  for  future  work. 


2.2.  Learner  voting.  The  bmi  uses  the  Sofia  ML  suite  for  incremental  machine  learning 
algorithms  [7].  In  particular  it  uses  logreg-pegasos,  i.e.  Logistic  Regression  with  Pegasos 
updates,  optimizing  over  ROC  area,  with  200k  iterations,  dimensionality  l.lmil,  and  A  = 
10“^.  These  were  all  parameters  established  by  the  track  organisers,  and  while  we  fiddled 
at  times  with  them,  we  found  no  compelling  reason  to  change  them. 
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The  only  change  we  made  was  at  a  higher  level.  The  Sofia  ML  library  provides  5  more 
ML  algorithms.  The  following  list  is  quoted  directly  from  the  manual,  we  refer  the  reader 
to  the  website^  and  D.  Sculley’s  publications  [7]  for  further  details. 

pegasos:  Use  the  Pegasos  SVM  learning  algorithm.  — lambda  sets  the  regularization 
parameter,  with  values  closer  to  zero  giving  less  regularization.  Note  that  Pegasos 
enforces  a  hard  constraint  that  the  model  weight  vector  must  lie  within  an  L2  ball 
of  radius  at  most  1/sqrt (lambda).  Also  relies  on  — eta_type 
sgd-svm:  Use  the  SGD-SVM  learning  algorithm,  -lambda  sets  the  regularization 
parameter,  with  values  closer  to  zero  giving  less  regularization.  Also  relies  on 
— eta_type  passive-aggressive  Use  the  Passive  Aggressive  Perceptron  learning  al¬ 
gorithm.  — passive-aggressive-c  sets  the  largest  step  size  to  be  taken  on  any 
update  step;  this  operates  as  a  capacity  term  with  values  closer  to  zero  encourag¬ 
ing  simpler  models.  — passive-aggressive-lambda  will  force  the  model  weight 
vector  to  lie  within  an  L2  ball  of  radius  1/sqrt  (passive-aggressive-lambda) 
margin-perceptron:  Use  the  Perceptron  with  Margins  algorithm.  Sets  the  update 
margin  with  — perceptron-margin-size.  When  set  to  0,  this  is  exactly  equiva¬ 
lent  to  the  classical  Perceptron  by  Rosenblatt.  When  set  to  1,  this  is  equivalent 
to  optimizing  SVM  hinge-loss  without  regularization.  Increasing  values  may  give 
additional  tolerance  to  noise.  Also  relies  on  — eta_type. 
romma:  Use  the  ROMMA  algorithm.  No  parameters  to  set. 

logreg-pegasos:  Use  Logistic  Regression  with  Pegasos  updates;  we  optimize  logistic 
loss  and  enforce  Pegasos-style  regularization  and  constraints,  with  — lambda  being 
the  regularization  parameter.  Also  relies  on  — eta_type.  Note  that  the  classifica¬ 
tion  values  provided  by  this  method  regression  are  logodds,  and  can  be  converted 
to  probabilities  using:  exp(p)  /  {1  +  exp(p)). 

The  runs  using  all  six  learners  (denoted  by  “6”  in  their  name),  during  each  iteration 
of  theBMi  take  first  all  the  documents  on  which  all  learners  agree,  then  those  on  which  5 
agree,  then  those  on  which  4  agree.  After  that,  they  complete  the  set  with  the  documents 
proposed  by  logreg-pegasos  but  are  not  yet  in  the  set  to  be  sent  for  evaluation.  This  was 
especially  necessary  in  the  first  few  iterations  where  extremely  little  agreement  was  found 
between  learning  methods. 


3.  Results 

For  each  recall  value,  we  performed  an  ANOVA  to  test  the  omnibus  hypothesis  that  all 
the  runs  are  equal  by  Precision.  In  most  cases,  and  particularly  for  high  recall  values,  this 
hypothesis  could  not  be  rejected  and  therefore  we  cannot  say  that  any  of  the  runs  are  actu¬ 
ally  different  from  the  baseline.  Where  the  hypothesis  was  rejected,  we  performed  pairwise 
tests  and  we  report  those.  All  tests  were  performed  using  Carterette’s  R  implementation 
[!]• 

The  sandbox  C  collection  (Mimic  2  Clinical  Decision  Dataset)  presents  a  strange  be¬ 
haviour  in  the  sense  that  it  never  reaches  recall  1.  From  our  own  data,  we  see  that  all  runs 


o 

■^https  :  / /github.  com/huitseeker/sof  ia-ml 
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run 

atho 

recall  0.95 

mel 

recall  1.00 

recall  0.95 

recall  1.00^ 

Ka 

recall  0.95 

ine 

recall  1.00 

prec 

fl 

prec 

fl 

prec 

fl 

prec 

fl 

prec 

fl 

prec 

fl 

bmi 

0.51 

0.60 

0.05 

0.09 

0.46 

0.59 

0.57 

0.70 

0.32 

INB 

0.52 

0.60 

0.06 

0.10 

0.44 

0.58 

0.55 

0.69 

0.31 

ISB 

0.52 

0.60 

0.05 

0.09 

0.45 

0.58 

0.55 

0.69 

0.31 

1ST 

0.50 

0.60 

0.04 

0.08 

0.45 

0.58 

0.25 

0.37 

0.57 

0.70 

0.21 

0.32 

6NB 

0.52 

0.61 

0.05 

0.09 

0.44 

0.57 

0.56 

0.69 

0.31 

6SB 

0.47 

0.56 

0.05 

0.08 

0.44 

0.58 

0.57 

0.70 

0.31 

6ST 

0.49 

0.58 

0.04 

0.08 

0.45 

0.58 

0.58 

0.71 

0.32 

^  for  all  topics,  after  rounding  up  to  nearest  0.01  (topic  Cll  was  rounded  up  from  0.9945) 


and  all  topics  reach  a  maximum  of  exactly  31174,  from  which  our  assumption  that  the 
dataset  contains  these  many  documents.  However,  the  User  Guide  of  the  dataset*  states 
that  the  April  2011  release  (version  2.6)  contains  around  33k  patients.  Apparently,  the 
difference  consists  of  documents  without  any  text.  Multiplying  the  reported  recall  with 
the  known  number  of  relevant  documents  per  topic,  we  observe  that,  while  exploring  the 
maximum  set  of  31174  indexed  documents,  we  are  missing,  on  average,  22.79  documents 
per  topic.  For  most  topics  this  is  above  0.995  recall  and  therefore  would  be  1.00  when 
rounding  up  to  the  nearest  cent  (10“^).  For  topic  Cll  however,  the  total  number  of  rele¬ 
vant  documents  is  only  180,  and  one  missing  document  results  in  a  recall  of  0.9945,  which 
rounds  up  to  0.99.  For  plotting,  we  forced  this  to  1.00  as  well,  to  maintain  visibility. 

Another  data  alteration  we  do  for  consistency  is  to  assign  recall  1.05  to  the  effort  and 
precision  results  reported  when  using  the  entire  dataset.  Therefore,  when  we  talk  about 
recall  1.00,  we  refer  to  the  first  time  this  recall  value  was  obtained.  Otherwise,  we  will  talk 
about  recall  on  the  dataset  (which  is  generally  1.00  except  for  topic  Cll  mentioned  above). 


Figures  1  and  2  show  the  precision  recall  curves  for  the  three  test  collections  and  for 
each  topic,  respectively.  For  athomel,  the  curves  are  statistically  indistinguishable,  except 
for  points  at  recall  30%  and  50%.  For  C,  the  curves  are  completely  indistinguishable,  and 
as  for  Kaine,  the  Webis  is  significantly  lower  than  the  other  runs. 

By  Precision-Recall  curve,  probably  the  most  interesting  run  is  athomel09  (Scarlet 
Letter  Law),  as  its  precision  increases  with  recall  almost  up  to  recall  1.  The  topic  had  506 
relevant  documents  in  the  collection.  The  information  need  is,  presumably,  any  information 
about  laws  whose  main  or  side-effect  is  a  public  shaming  of  individuals,  but  it  may  be 
also  referring  only  to  a  specific  law  passed  and  then  repealed  in  Florida.  A  quick  grep 
on  the  collection  shows  that  there  are  22  documents  actually  containing  the  two  words 
“scarlet”  and  “letter”  separated  by  3  characters  or  less.  All  of  these  documents  contain 
also  the  term  “Jeb”  (case  sensitive,  representing  the  first  name  of  former  Florida  governor, 
Jeb  Bush).  Individually,  “scarlet”  (ignoring  case  and  surrounded  by  non-alphanumeric 
characters)  appears  in  70  documents,  66  of  which  also  contain  the  term  “Jeb” .  The  4  other 
documents  refer  to  people  named  Scarlet  (a  more  common  spelling  of  this  name  is  with  a 

*https : / /physionet . org/mimic2/UserGuide/nodel5 .html 
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recall 


B  bmi  B  TUW_1NB  B  TUW_1SB  B  TUW_1  ST 


TUW_6NB 


TUW_6SB 


B 


TUW_6ST 


Figure  1.  Average  Precision- Recall  curve  for  each  collection.  Points  be¬ 
yond  100%  recall  represent  precision  after  the  entire  dataset  was  evaluated. 


double  ending  consonant).  From  these  70  to  the  total  of  506  the  system  has  to  figure  it 
out  using  only  “letter”  and  “law” ,  two  relatively  common  terms. 

Figures  3  and  4  show  the  recall  versus  effort,  as  calculated  by  the  organizers.  Figures  3 
shows  the  average  over  all  topics,  by  collection,  run,  and  coefficients  a  and  b.  Figure  4 
shows  details  for  each  topic,  for  a  fixed  6  =  0. 

4.  Conclusion 

Our  submissions  this  year  did  not  improve  upon  the  provided  baseline.  This  appears 
to  have  been  the  general  observation  of  this  year’s  track:  “Several  manual  and  automatie 
participant  efforts  achieve  higher  recall  with  less  effort  than  the  baseline  on  some  topics, 
but  none  consistently  improves  on  the  baseline”  [5].  In  our  case,  the  use  of  stopwords 
appears  to  be  counter  productive,  even  though  we  used  very  few  of  them.  The  modified 
weighting  scheme  showed  insignificant  improvements  on  athomel.  The  use  of  multiple 


precision 
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Figure  2.  Precision- Recall  curve  for  each  topic  and  all  runs.  Points  beyond 
100%  recall  represent  precision  after  the  entire  dataset  was  evaluated. 
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80% 

70% 

60% 


100% 

90% 

80% 

70% 

60% 


100% 

90% 

80% 

70% 

60% 


1  2  4 


1 


2  4 

a 


1  2  4 


TUW_1NB 


TUW_1SB 


TUW_1ST 


TUW_6NB 


TUW_6SB 


TUW_6ST 


Figure  3.  Average  Effort.  The  b  coefficient  is  in  the  title  of  each  sub-plot 


learners  increased  processing  time  significantly  without  having  any  (positive)  effect  on 
effectiveness. 


Appendix  A.  Stop  words  list 

a,  about,  above,  after,  again,  against,  all,  am,  an,  and,  any,  are,  aren’t, 
as,  at,  be,  because,  been,  before,  being,  below,  between,  both,  but,  by,  can’t, 
cannot,  could,  couldn’t,  did,  didn’t,  do,  does,  doesn’t,  doing,  don’t,  down, 
during,  each,  few,  for,  from,  further,  had,  hadn’t,  has,  hasn’t,  have,  haven’t, 
having,  he,  he’d,  he’ll,  he’s,  her,  here,  here’s,  hers,  herself,  him,  himself, 
his,  how,  how’s,  i,  i’d,  i’ll,  i’m,  i’ve,  if,  in,  into,  is,  isn’t,  it,  it’s, 
its,  itself,  let’s,  me,  more,  most,  mustn’t,  my,  myself,  no,  nor,  not,  of,  off, 
on,  once,  only,  or,  other,  ought,  our,  ours,  ourselves,  out,  over,  own,  same, 
shan’t,  she,  she’d,  she’ll,  she’s,  should,  shouldn’t,  so,  some,  such,  than, 
that,  that’s,  the,  their,  theirs,  them,  themselves,  then,  there,  there’s,  these. 
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1.00 

0.75 

0.50 

0.25 

0.00 

1.00 

0.75 

0.50 

0.25 

0.00 

1.00 

0.75 

0.50 

0.25 

0.00 

1.00 

0.75 

0.50 

0.25 

0.00 

1.00 

0.75 

0.50 

0.25 

0.00 

1.00 

0.75 

0.50 

0.25 

0.00 

1.00 

0.75 

0.50 

0.25 

0.00 


C06  C07  COS 


C11  C12  CIS 


C16 


C17 


CIS 


I 


open 


record  restricted 


1  2  4 


I  I 

1  2  4 


1  2  4 

a 


C09 


CIO 


C14 


C19 


tech 


C15 


Figure  4.  Effort  for  each  topic  for  b=0 
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they,  they’d,  they’ll,  they’re,  they’ve,  this,  those,  through,  to,  too,  under, 
until,  up,  very,  was,  wasn’t,  we,  we’d,  we’ll,  we’re,  we’ve,  were,  weren’t, 
what,  what’s,  when,  when’s,  where,  where’s,  which,  while,  who,  who’s,  whom, 
why,  why’s,  with,  won’t,  would,  wouldn’t,  you,  you’d,  you’ll,  you’re,  you’ve, 
your,  yours,  yourself,  yourselves 
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